2. Definitions

  • intrinsic vs. post-hoc

  • global vs. local explanation

  • workflow

Learning outcomes

  1. Compare the competing definitions of interpretable machine learning, the motivations behind them, and metrics that can be used to quantify whether they have been met.

Reading

  • Lipton, Z. C. (2018). The Mythos of Model Interpretability. ACM Queue: Tomorrow’s Computing Today, 16(3), 31–57. https://doi.org/10.1145/3236386.3241340

  • Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., & Yu, B. (2019). Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116(44), 22071–22080. https://doi.org/10.1073/pnas.1900654116

Intrinsic Interpretability vs. Post-hoc Explanations

  • Intrinsically interpretable: Build a “glass box” from the start. The model is interpretable by design—its structure allows us to understand how it works.

  • Post Hoc: Inspect an already trained “black box” model, which can be chosen simply to maximize accuracy without regard to interpretability. Post-hoc methods extract explanations from models that weren’t designed to be understood

Intrinsic interpretability

Some properties that make a model intrinsically interpretable are:

  • Sparsity
  • Simulatability
  • Modularity

We define these on the next few slides.

Sparsity

  • A model is sparse if the number of non-zero parameters is small relative to the total number of available parameters.

  • This is interpretability property is motivated by Occam’s razor: The simplest explanation is likely closest to the truth.

  • Sparsity enhances interpretability only when it correctly captures the structure of the true data-generating process. If the true relationship depends on many features, imposing sparsity introduces bias.

Simulatability

A model is simulatable if it’s possible to manually compute its output for any input within a reasonable time. Both the number of model parameters and the inference complexity factor in.

Modularity

  • A model is modular if its prediction function \(f(x)\) can be decomposed into interpretable components, each of which can be analyzed independently.

  • One example is an additive decomposition, where the function can be written as,

    \[\begin{align*} f\left(x\right) = b_{0} + \sum_{j = 1}^{J}f_{j}\left(x_{j}\right) \end{align*}\]

    and each \(f_{i}\) operates on only a single coordinate of the input \(x\).

  • More generally, a model is modular if subsets of its parameters or computations can be viewed as separate, interpretable units.

Discussion: Linear Model Interpretability

Post-hoc interpretability

Some common strategies for explaining black box models include,

  • Feature importances

  • Feature attributions

  • Model distillations

A few examples are given on the next few slides, but first we should distinguish between global and local explanations.

Global explanation

  • A global explanation characterizes how a model behaves across all possible inputs.

  • These explanations are valuable in scientific studies, where we usually look for universal rules relating sets of variables.

Example: Variable importance

According to this plot, the variables X4, X2, and X1 seem most important across three separate tree-based models (weeks 4 - 5).

Local explanation

  • A local explanation describes why a model made a specific prediction for a particular input. The relationships it finds may be unique to that example.

  • These are especially helpful in auditing high-stakes decisions made in specific cases, e.g. loan approvals, medical diagnoses, parole decisions.

Example: saliency map

In vision models, one use case of local explanations is to identify the parts of an image that are “most important” for particular predicted classes.

Example: SHAP attributions

For this sample, the most “important” feature is Capital Gain. It is responsible for much of the negative prediction \(f\left(x\right)\), in a sense that will be made precise when we discuss SHAP.

Interpretability-accuracy trade-offs

  • For many datasets, simple models (like decision trees) offer high descriptive accuracy but have lower predictive accuracy compared to more complex models (like random forests).

  • Similarly, deep neural networks might predict the future well but not be amenable to accurate descriptions in PDR sense.

Bias and Variance

One intuition for this tradeoff comes from the bias variance tradeoff. More complex models have lower bias, and if we have lots of data, variance is less of a concern.

Does it exist?

Unlike the bias-variance tradeoff, which has a precise mathematical foundation, the interpretability-accuracy tradeoff is more of a heuristic, and some have argued that it can be misleading (Rudin, 2019).

Exercise:

If applied systematically, techniques from interpretability can be used to check assumptions, identify data and model quality issues, and uncover surprising relationships. The steps below expand on ideas in (Murdoch, Singh, Kumbier, Abbasi-Asl, and Yu, 2019).

%%{init: {'theme':'forest', 'themeVariables': {'fontSize':'36px', 'fontFamily':'arial'}, 'flowchart': {'padding': 80}}}%%
graph LR
    A[Design] --> B[Predictive <br/> Accuracy]
    B --> C[Stability]
    C --> D[Explanation <br/> Comparisons]
    D --> E[External <br/> Checks]

Data collection

  1. First consider the underlying data collection mechanism. This is essential context for any downstream interpretation.

  2. Were the data purely observational, and if so, what led to a sample being observed (or missed)?

  3. If it is from an experiment, which aspects were controlled? In what ways is the experimental system representative of the end-use?

Predictive accuracy

  1. Measure the model’s fit to the data using some measure of performance accuracy on a test set.

  2. The test set should mimic what would be encountered during deployment as closely as possible, even though this is impossible in situations where the environment is always changing.

Interpretation/Explanation

Next we can apply our interpretability technique. We should pay attention to descriptive accuracy. For example,

  • How many features do we need before the SHAP approximation matches the full model prediction (Week 6 - 7)?

  • How closely does the distilled model match the original model’s predictions (Week 10)?

  • If using permutation importance, are the permuted data plausible, or do they lie outside the training data (Week 4)?

Stability

  1. Are the interpretations robust to perturbations?

  2. Data perturbations: Do the same interpretations appear when using different subsamples?

  3. Hyperparameter perturbations: Do the same interpretations across similar hyperparameters?

  4. Model perturbations: If using a post-hoc explanation, we can see whether the same features arise across multiple models?

Compare interpretations

  1. Apply multiple interpretation methods to the same model.

  2. If methods give contradictory results, then at least one has low descriptive accuracy.

External checks

Test interpretations against domain knowledge or experimental results. If the interpretation claims feature \(j\) is important, then can be checked by

  • comparing with existing scientific results
  • measuring performance when feature \(j\) is removed or held constant
  • running new experiments that manipulate feature \(j\).

Explanations for Communication

The talk (Kim, 2022) describes how interpretability bridges human and machine “concepts.”

This bridge is especially important in collaborative work! Data science never exists in a vacuum.